Searching for Patterns of Thermostability in Proteins and Defining the Main Features Contributing to Enzyme Thermostability through Screening, Clustering, and Decision Tree Algorithms

نویسندگان

  • M. Ebrahimi
  • E. Ebrahimie
چکیده

Finding or making thermostable enzymes has been identified as an important goal in a number of different industries. Therefore, understanding the features involved in enzyme thermostability is crucial, and different approaches have been used to extract or manufacture thermostable enzymes. Herein we examined features that contribute to the thermostability of 2,946 proteins. We used various screening techniques (anomaly detection, feature selection), clustering methods (K-Means, TwoStep cluster), decision tree models (Classification and Regression Tree, CHAID, Exhaustive CHAID, QUEST, C5.0), and generalized rule induction (association) (GRI) models to search for patterns of thermostability and to find features that contribute to enzyme thermal stability. We found that Arg as the N-terminal amino acid was found solely in proteins working at temperatures higher than 70 oC. Fifty-four protein features were shown to be important in feature selection modeling, and the number of peer groups with an anomaly index of 2.12 declined from 7 to 2 after being run using only important selected features; however, no changes were found in the numbers of groups when K-Means and TwoStep clustering modeling was performed on datasets with/without feature selection filtering. The depth of the trees generated by various decision tree models varied from 14 (in the C5.0 model with 10-fold cross-validation and with feature selection of the dataset) to 4 (in CHAID models) branches. The performance evaluation of the decision tree models tested here showed that C5.0 was the best and the Quest model was the worst. We did not find any significant difference in the percent of correctness, performance evaluation, and mean correctness of various decision tree models when feature selected datasets were used, but the number of peer groups in clustering models was reduced significantly (p<0.05) compared to datasets without feature selection. In all decision tree models, the frequency of Gln was the most important feature for decision tree rule sets; moreover, in all GRI association rules (100 rules), the frequency of Gln was used in antecedent to support the rules. The importance of Gln in protein thermostability is discussed in this paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Engineering Thermostable Enzymes; Application of Unsupervised Clustering Algorithms

There is a high demand for engineering thermostable enzymes in some industries; especially in paper industries to use environmental friendly enzymes instead of toxic chlorine chemicals. Hence, understanding protein attributes involved in enzyme thermostability is important. Herein, the most important protein features contributing to enzyme thermostability was searched by using data mining algor...

متن کامل

An expert system to predict protein thermostability using decision tree

Protein thermostability information is closely linked to commercial production of many biomaterials. Recent developments have shown that amino acid composition, special sequence patterns and hydrogen bonds, disulfide bonds, salt bridges and so on are of considerable importance to thermostability. In this study, we present a system to integrate these various factors that predict protein thermost...

متن کامل

Increasing Performance and Thermostability of D-Phenylglycine Aminotransferase in Miscible Organic Solvents

Background: D-Phenylglycine aminotransferase (D-PhgAT) is highly beneficial in pharmaceutical biotechnology. Like many other enzymes, D-PhgAT suffers from low stability under harsh processing conditions, poor solubility of substrate, products and occasional microbial contamination. Incorporation of miscible organic solvents into the enzyme’s reaction is considered as a solution...

متن کامل

افزایش ویژگی‌های عملیاتی آنزیم اندوگلوکاناز از طریق تغییر اسیدآمینه‌ای

    Background & Aims : Ethanol produced from plant cellulose is called bioethanol and is recognized as a unique sustainable liquid fuel with powerful economic and environmental effects. In the present study we aimed at integrate a cellulase gene in to yeast genome to have the enzyme secreted out of the cell. Subsequently cellulose is depredated to glucose by the enzyme, and then it is ferment ...

متن کامل

Signal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases

Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009